Multiple Linear Regression Models

STA 101 - Summer I 2022

Raphael Morsomme

Welcome

Announcements

  • assignments

Recap

  • simple linear regression model \[ \text{hwy} \approx \beta_0 + \beta_1 \text{cty} \]
  • residuals
  • least-square estimates
  • parameter interpretation
  • model comparison with $R^2%
  • outliers

Outline

  • multiple linear regression
  • feature engineering
  • model comparison
  • predictive performance

Multiple linear regression

Remember, to improve our initial model with \((\beta_0, \beta_1) = (1, 1.3)\), we could (i) find better estimates, (ii) use additional predictors

  • for (i), the least-square estimates are usually very good
  • for (ii), we use a multiple linear regression model

For instance, to predict hwy we could more variables than just cty.

The mpg data set

d <- ggplot2::mpg
head(d, n = 4)
# A tibble: 4 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~

Instead of fitting a model with only cty or only displ, we could fit a model with both predictors!

Linear regression with 2 predictors

The model equation is \[ \text{hwy} \approx \beta_0 + \beta_1 \text{cty} + \beta_2 \text{displ} \]

We can find the least-square estimate (minimizing the SSR [SSE]) with the command lm on R.

lm(hwy ~ cty + displ, data = d)

Call:
lm(formula = hwy ~ cty + displ, data = d)

Coefficients:
(Intercept)          cty        displ  
    1.15145      1.32914     -0.03432  

which give the following regression equation

\[ \text{hwy} \approx 1.15145 + 1.32914 \text{cty} - 0.03432 \text{displ} \]

Special case: categorical predictor

In a regression model, categorical predictors are represented using indicator variables.

To represent a categorical predictor with \(k\) levels, we use \((k-1)\) indicator variables.

Including drv

For instance, the categorical variable drv has \(k=3\) levels (4, f and r), so we can represent it with 2 indicator variables with the following model equation

\[ \text{hwy} \approx \beta_0 + \beta_1 \text{drv_f} + \beta_2 \text{drv_r} \]

Note that for a binary variable \(k=2\), so we only need \(k-1=1\) indicator variable (see previous set of slides).

Using categorical predictors in R

lm(hwy ~ drv, data = d)

Call:
lm(formula = hwy ~ drv, data = d)

Coefficients:
(Intercept)         drvf         drvr  
     19.175        8.986        1.825  

The equation with the least-square estimates is \[ \text{hwy} \approx 19.175 + 8.986 \text{drv_f} + 1.825 \text{drv_r} \]

Intepreting the output

\[ \text{hwy} \approx 19.175 + 8.986 \text{drv_f} + 1.825 \text{drv_r} \]

If a new vehicle has a drv that is:

  • 4, then the indicator variables drv_f and drv_r take the value 0, and the prediction is \(\widehat{hwy} = 19.175\)
  • f, then the indicator variables drv_f takes the value 1 and drv_r the value 0, and the prediction is \(\widehat{hwy} = 19.175 + 8.986 = 28.161\)
  • r, then the indicator variables drv_f takes the value 0 and drv_r the value 1, and the prediction is \(\widehat{hwy} = 19.175 + 1.825 = 21\)

The level 4 is called the reference level.

Group exercise - categorical predictor

  • Exercises 8.5, 8.7

  • In addition

  1. fit the first models in R
  2. identify the baseline level of the categorical variable in each model.
library(openintro)
d <- openintro::births14
05:00

Fitting a large model

Let us fit a model with cty, drv and disp

m_large <- lm(hwy ~ cty + drv + displ, data = d)
m_large

Call:
lm(formula = hwy ~ cty + drv + displ, data = d)

Coefficients:
(Intercept)          cty         drvf         drvr        displ  
      3.424        1.157        2.158        2.360       -0.208  

Its \(R^2\) is 0.9384357.

glance(m_large)$r.squared
[1] 0.9384357

Unsurprisingly, including additional predictors makes the regression line closer to the points \(\Rightarrow\) residuals are smaller \(\Rightarrow\) SSR is smaller \(\Rightarrow\) \(R^2\) is larger.

Fitting a larger model

Actually, I include all predictors (except for model).

m_larger <- lm(hwy ~ manufacturer + displ + year + cyl + trans + drv + cty + fl + class, data = d)

Thanks to the additional predictors, the residuals are very small, making \(R^2\) close to \(1\).

glance(m_larger)$r.squared
[1] 0.9773386

We will see in the next lecture that this is not always a good sign.

Group exercise - multiple linear regression

  • Exercises 8.9

  • In addition,

  1. fit the model in R
  2. identify the type of each variable
  3. identify the baseline level of the categorical predictors
library(openintro)
d_birth <- openintro::births14
05:00

Statistics as an art – feature engineering

We saw that adding predictors to the model seems to help.

However, the variables including in the data set, e.g. displ, year, etc, may not be the most useful predictors for hwy.

Feature engineering refers to the creation of new predictors from existing ones.

This is where your understanding of the data makes, scientific knowledge, etc, makes a big difference.

Transforming a variable

Consider the predictor displ

ggplot(d) +
  geom_point(aes(displ, hwy))

The relation bwteen displ and hwy is not exactly linear.

Let us include the predictor \(\dfrac{1}{\text{displ}}\) to capture this nonlinear relation.

The model equation is \[ \text{hwy} \approx \beta_0+ \beta_1 \text{displ} + \beta_2 \dfrac{1}{\text{displ}} \] The least-square coefficient estimates are

d_transf <- mutate(d, displ_inv = 1/displ)
m <- lm(hwy ~ displ + displ_inv, data = d_transf)
m

Call:
lm(formula = hwy ~ displ + displ_inv, data = d_transf)

Coefficients:
(Intercept)        displ    displ_inv  
    12.2601      -0.2332      36.1456  

And the regression lines captures the nonlinear relation.

augment(m) %>%
  ggplot(aes(displ, hwy)) +
  geom_point() +
  geom_line(aes(y = .fitted))

The trees data set

  • Measurements of the diameter, height and volume of timber in 31 felled black cherry trees.
  • Note that the diameter (in inches) is erroneously labelled Girth in the data
  • The diameter (Girth) is measured at 4 ft 6 in above the ground.
  • Source: Atkinson, A. C. (1985) Plots, Transformations and Regression. Oxford University Press.

single tree

Combining variables

Going one step further, we can combine existing variables to create a new predictor.

Consider the tree dataset

d_tree <- datasets::trees
head(d_tree)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

where Girth indicates the tree diameter (twice the radius) in inches.

Now, estimate volume from height and girth

we could approximate the shape of a tree with a cylinder

volume of a cylinder is

\[ V = \pi r^2 h \]

This suggests approximating the volume of a tree with the model

\[ \text{Volume} = \beta_1 (\dfrac{\text{Girth}}{2}})^2 * \text{Height} \]

We simply need to create a new variable corresponding to \((\dfrac{\text{Girth}}{2}})^2 * \text{Height}\). Note that I first need to transform the variable Girth into feet to ensure that all variable have the same unit.

d_tree_comb <- d_tree %>%
  mutate(
    Girth_ft    = Girth / 12,
    radius      = Girth_ft / 2,
    r_squared   = radius^2,
    r2h = r_squared * Height
  )
head(d_tree_comb)
  Girth Height Volume  Girth_ft    radius r_squared       r2h
1   8.3     70   10.3 0.6916667 0.3458333 0.1196007  8.372049
2   8.6     65   10.3 0.7166667 0.3583333 0.1284028  8.346181
3   8.8     63   10.2 0.7333333 0.3666667 0.1344444  8.470000
4  10.5     72   16.4 0.8750000 0.4375000 0.1914062 13.781250
5  10.7     81   18.8 0.8916667 0.4458333 0.1987674 16.100156
6  10.8     83   19.7 0.9000000 0.4500000 0.2025000 16.807500

m <- lm(Volume ~ r2h - 1, data = d_tree_comb)
summary(m)

Call:
lm(formula = Volume ~ r2h - 1, data = d_tree_comb)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6696 -1.0832 -0.3341  1.6045  4.2944 

Coefficients:
    Estimate Std. Error t value            Pr(>|t|)    
r2h  1.21427    0.01568   77.44 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.455 on 30 degrees of freedom
Multiple R-squared:  0.995, Adjusted R-squared:  0.9949 
F-statistic:  5996 on 1 and 30 DF,  p-value: < 0.00000000000000022

When we do not include the intercept in a model, \(R^2\) measures something different and should therefore not be interpreted in the usual way.

A special case of data combination: interaction

Suppose you are interested in predicting the average run time (duration) of amateur jogging races.

Two variables that impact the duration are (i) the distance of the race, and (ii) the weather.

For simplicity, we measure the weather as either good (nice weather) or bad (rain, heat wave, etc).

Our model equation is therefore

\[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 \text{weather_bad} \]

where \(\beta_1\) indicate the effect of an additional miles on the duration and \(\beta_2\) the effect of bad weather (probably positive).

Note that the effect of weather is fixed in this model, say “\(+5\) minutes.

Is this reasonable? No!

The effect of weather should vary with distance. For shorter races, bad weather may add only 2 or 3 minutes, while for longer races, bad weather may increase the average duration by 10 or 15 minutes.

We capture this expected pattern using an interaction term.

\[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 \text{weather_bad} + \beta_3 \text{weather_bad}*\text{distance} \]

Interpreting interactions

  • When the weather is good, the equation simplifies to \[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 0 + \beta_3 0*\text{distance} = \beta_0 + \beta_1 \text{distance} \]

  • When the weather is bad, the equation simplifies to \[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 1 + \beta_3 1*\text{distance} = (\beta_0 + \beta_2) + (\beta_1+\beta_3) \text{distance} \]

The new slope term is \(\beta_1+\beta_3\), meaning that the effect of an additional miles on the average duration is \(\beta_1+\beta_3\) (not \(\beta_3\)).

The effect of distance varies depending on the weather; the two variable interact.

Recap

Recap

  • simple linear regression model \[ Y \approx \beta_0 + \beta_1 X \]

  • multiple linear regression model \[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + + \beta_p X_p \]

  • categorical predictor

    • \((k-1)\) indicator variables
  • feature engineering

    • transforming variables
    • combining variables